Assignment 1

I’m here to learn new skills and improve olds. The course fits perdectly to my needs. I have used R a lot, but havent shared my code outside of our research group, so learning good practises is very welcome. I did go throw Exercise1 ‘warm up’ pretty fast and everything was pretty familiar. As a note for my self, the syntax and used functions was many way different than what i have used to get same goal. Might be good to start to use less ‘intermediate’ objects and more ‘tidyverse’ style. I like to use Atom for writing a script, so it’s be nice to try to do things differently.

My github: https://github.com/jvehvi/IODS-project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Thu Dec  1 00:06:24 2022"

Assignment 2 Regression and model validation

This week topic was Regression and model validation. Here we present basic R commands to build up summary tables and linear models which can be used for finding cause-and-effect relationships between different variables.

Data wrangling

PART 1

# Set up workind dir
setwd("E:/Open science course 2022/IODS-project") 
dir.create(paste0(getwd(),"/Data"))
## Warning in dir.create(paste0(getwd(), "/Data")): 'E:\Open science course
## 2022\IODS-project\Data' already exists
setwd(paste0(getwd(),"/Data"))

PART 2

Bring .txt file to R environment which has been downloaded to Data - folder.

Table should have 183 rows and 60 cols containing integer values in all except last column which contains information about gender by character.

PART 3

  • Subset Data set to contain columns: gender, age, attitude, deep, stra, surf and points.
  • Remove also rows which have 0 in Points.
  • Deep is calculated by taking the mean of cols: c(“D03”,“D11”,“D19”,“D27”,“D03”,“D11”,“D19”,“D27”,“D06”,“D15”,“D23”,“D31”) and excluding 0.
  • Stra is calculated by taking the mean of cols: c(“ST01”,“ST09”,“ST17”,“ST25”,“ST04”,“ST12”,“ST20”,“ST28”) and excluding 0.
  • Surf is calculated by taking the mean of cols: c(“SU02”,“SU10”,“SU18”,“SU26”,“SU05”,“SU13”,“SU21”,“SU29”,“SU08”,“SU16”,“SU24”,“SU32”) and excluding 0.

PART 4

Analysis

PART 1 - Read file to R environment

Data set contains summary results of course ‘Johdatus yhteiskuntatilastotieteeseen, syksy 2014’ survey. There should be 7 variables (gender, age, attitude, deep, stra, surf and points) and 166 observations.’’’ Gender: Male = 1 Female = 2 Age: Age (in years) derived from the date of birth Attitude: Global attitude toward statistics. Mean of original variables (~Da+Db+Dc+Dd+De+Df+Dg+Dh+Di+Dj) Deep: Deep approach. Mean of original variables (~d_sm+d_ri+d_ue) Stra: Strategic approach. Mean of original variables ( ~st_os+st_tm) Surf: Surface approach. Mean of original variables (~su_lp+su_um+su_sb) Points: Total counts from survey. More information about used variables can be found from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt’’’

## [1] 166   8

PART 2

By commend summary we can take a look ‘inside’ the data. By the command we get ‘boxplot information’ about our numerical data in dataframe. Scatterplot matrix is used to describe relationships between the variables. It’s constructed from the dataframe with ggpairs -function (ggplot2 -package). Result plot shows, in addition of variables relationships, variables diverging and gives correlation coefficients with asterix showing level of significance.

## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
##     Gender               Age           Attitude          Deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.375  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.250  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.750  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.645  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.000  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.875  
##       Stra            Surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Most promising relationship seems to be between: Attitude and Points, and Surf and Deep.
There seems to be also some kind of relationship between: Stra and Surf. As overall, matrix gives good overlook of data, and starting point to study more relationships between variables’’’

PART 3 and PART 4

Create a regression model with multiple explanatory variables

## 
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = learning_2014.2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## Attitude      3.3952     0.5741   5.913 1.93e-08 ***
## Stra          0.8531     0.5416   1.575  0.11716    
## Surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Summary of a regression model shows that only Attitude seems to correlate significantly with Points. From the print, you can see model residiuals; summary table (estimate value, std. error, t-value and p-value for all variables in model against Points). Significance of variable correlation can be read from p-value(last column of Coefficients table). Significance levels threshols are given under the table. Summary gives also p-value for whole model which isn’t significant because model contains variables that hasn’t have relationship to Points. In next step, lest remove other variables from the model and see what happens

## 
## Call:
## lm(formula = Points ~ Attitude, data = learning_2014.2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## Attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now results seems to be better and p-value is significant for the variable relationship as well as for the model. Multiple R-squared is the proportion of the variation in dependent variable that can be explained by the independent variable. So in the model where we haves three variables 20.74 % of the variation in Points can be explained by variables. But now interesting is that Attitude by itself explains 19.06 % of the variation. Showing us that Stra and Surf effect to the points is pretty minimal.

PART 5 - Study more our model with diagnostic plots

From the ‘Residual vs leverage’ - plot we can check which and how many of observation are influential. In our case data seems good and there isnt any point outside Cook distance lines.

Also ‘Residual vs. Fitted’ - plot seems good. Data is divided evenly in x - and y-axle.

QQ-plot also indicates goodness of our model. If the points runs out too early from the line, there might be some other variables effecting our relationship more than the Attitude variable. In this case QQplot seems to be really nice, but noT perfect. ```


Assignment 3 - Logistic regression

Bring Data to R

For the study material from UCI Machine Learning Repository, Student Performance Data (incl. Alcohol consumption) page (https://archive.ics.uci.edu/ml/datasets/Student+Performance) has been used as a source data table. The joined data set which will be used in the analysis, contains two student alcohol consumption data sets from the web page. The variables not used for joining the two data have been combined by averaging (including the grade variables): + ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’ + ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise More information about variables can be found from: https://archive.ics.uci.edu/ml/datasets/Student+Performance

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   0.3.5
## Warning: package 'tibble' was built under R version 4.2.2
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'dplyr' was built under R version 4.2.2
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# Read data table to R - environment
wrangled_student_mat_por <- as.data.frame(read_csv("Data/wrangled_student_mat_por.csv"))
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(wrangled_student_mat_por)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Select intresting variables

For studying variables relationships to alcohol, I have selected four variables that I hypothesize to have relationship to person alcohol consumption.

  • Age - I think that persons over 18 use more alcohol.

  • romantic - I think that single persons drinks more often.

  • studytime - I think persons which study more weekly, drink less.

  • freetime - I think persons which have more free time might drink more.

library(ggplot2)
library(GGally)

# Summary table
int_variables<-c("age","romantic","studytime","freetime","alc_use","high_use")
summary(wrangled_student_mat_por[int_variables])
##       age          romantic           studytime        freetime    
##  Min.   :15.00   Length:370         Min.   :1.000   Min.   :1.000  
##  1st Qu.:16.00   Class :character   1st Qu.:1.000   1st Qu.:3.000  
##  Median :17.00   Mode  :character   Median :2.000   Median :3.000  
##  Mean   :16.58                      Mean   :2.043   Mean   :3.224  
##  3rd Qu.:17.00                      3rd Qu.:2.000   3rd Qu.:4.000  
##  Max.   :22.00                      Max.   :4.000   Max.   :5.000  
##     alc_use       high_use      
##  Min.   :1.000   Mode :logical  
##  1st Qu.:1.000   FALSE:259      
##  Median :1.500   TRUE :111      
##  Mean   :1.889                  
##  3rd Qu.:2.500                  
##  Max.   :5.000
# Unique values for romantic column
unique(wrangled_student_mat_por$romantic)
## [1] "no"  "yes"
# Group by romantic before summarise
wrangled_student_mat_por %>% group_by(romantic) %>% summarise(count = n())
## # A tibble: 2 × 2
##   romantic count
##   <chr>    <int>
## 1 no         251
## 2 yes        119
# Scatterplot
p <- ggpairs(wrangled_student_mat_por[int_variables], mapping = aes(), lower =list(combo = wrap("facethist", bins = 20))) 
p

Results shows that there seems to be some kind of relationship at least between alchol and freetime, - studytime and age. POsitive correlation can be seem between alcohol use and freetime. Negative correlation between alcohol use and study time, and alcohol use and age. Intresting find out is that young persons (15-18) seems to drink more than ages 19 -21.

Logistic regression

Use glm() to fit a logistic regression model with high_use as the target variable and freetime, studytime, age and romantic as the predictors.

m <- glm(high_use ~ freetime + studytime + age + factor(romantic), data = wrangled_student_mat_por, family = "binomial")

# Summary
summary(m)
## 
## Call:
## glm(formula = high_use ~ freetime + studytime + age + factor(romantic), 
##     family = "binomial", data = wrangled_student_mat_por)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6628  -0.8579  -0.6654   1.1899   2.3332  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -4.4470     1.7762  -2.504 0.012293 *  
## freetime              0.3269     0.1238   2.641 0.008274 ** 
## studytime            -0.5850     0.1572  -3.721 0.000198 ***
## age                   0.2246     0.1025   2.190 0.028512 *  
## factor(romantic)yes  -0.2158     0.2594  -0.832 0.405404    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 421.66  on 365  degrees of freedom
## AIC: 431.66
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                             OR        2.5 %    97.5 %
## (Intercept)         0.01171392 0.0003399901 0.3664590
## freetime            1.38664482 1.0914115708 1.7751196
## studytime           0.55710460 0.4053055149 0.7516885
## age                 1.25176377 1.0261046822 1.5355541
## factor(romantic)yes 0.80586877 0.4806650258 1.3321754

The results shows that widest range is for variable ‘romantic’. Range also contains 1 so most probably alcohol_consumption and romantic don’t have relationships, more like the variables are independent of each other. Freetime and age has OD and condidence interval > 1, so there seems to be positive correlation between those and high_use. Studytime values are < 1 meaning negative correlation between it and high_use.

Predictive power of the model

Next, I have build up new model excluding ‘romantic’ which seems to have no-relationship to students alcohol use. After building model, I have taken summary information about the model and calculated the mean prediction error.

library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# Let's select only freetime, studytime and age for the next model
m <- glm(high_use ~ freetime + studytime + age, data = wrangled_student_mat_por, family = "binomial")

# Summary
summary(m)
## 
## Call:
## glm(formula = high_use ~ freetime + studytime + age, family = "binomial", 
##     data = wrangled_student_mat_por)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6143  -0.8690  -0.6895   1.2088   2.2716  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.2829     1.7668  -2.424 0.015345 *  
## freetime      0.3246     0.1236   2.626 0.008630 ** 
## studytime    -0.5928     0.1575  -3.765 0.000167 ***
## age           0.2119     0.1014   2.089 0.036700 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 422.36  on 366  degrees of freedom
## AIC: 430.36
## 
## Number of Fisher Scoring iterations: 4
# Making predictions on the train set.
# Probability
wrangled_student_mat_por <- mutate(wrangled_student_mat_por, probability = predict(m, type = "response"))
# Prediction
wrangled_student_mat_por <- mutate(wrangled_student_mat_por, prediction = probability > 0.5)
# 2x2 cross tabulation of predictions versus the actual values
table(high_use = wrangled_student_mat_por$high_use, prediction = wrangled_student_mat_por$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   245   14
##    TRUE     96   15
# Making predictions on the test set.
test_pred<- ifelse(predict(m, newdata = wrangled_student_mat_por, type = "response") > 0.5, "TRUE", "FALSE")
# 2x2 cross tabulation of predictions versus the actual values
table(high_use = wrangled_student_mat_por$high_use, prediction = test_pred)
##         prediction
## high_use FALSE TRUE
##    FALSE   245   14
##    TRUE     96   15
# We can study our model goodness more by looking ConfusionMatrix of the results
confusionMatrix(table(high_use = wrangled_student_mat_por$high_use, prediction = test_pred))
## Confusion Matrix and Statistics
## 
##         prediction
## high_use FALSE TRUE
##    FALSE   245   14
##    TRUE     96   15
##                                           
##                Accuracy : 0.7027          
##                  95% CI : (0.6533, 0.7488)
##     No Information Rate : 0.9216          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1028          
##                                           
##  Mcnemar's Test P-Value : 1.136e-14       
##                                           
##             Sensitivity : 0.7185          
##             Specificity : 0.5172          
##          Pos Pred Value : 0.9459          
##          Neg Pred Value : 0.1351          
##              Prevalence : 0.9216          
##          Detection Rate : 0.6622          
##    Detection Prevalence : 0.7000          
##       Balanced Accuracy : 0.6179          
##                                           
##        'Positive' Class : FALSE           
## 
# Prediction plot
# access dplyr and ggplot2
library(dplyr); library(ggplot2)

# initialize a plot of 'high_use' versus 'probability' in 'wrangled_student_mat_por'
# define the geom as points and draw the plot
g <- ggplot(wrangled_student_mat_por, aes(x = high_use, y =as.numeric(probability))) +
    geom_point(aes(y = as.numeric(probability)), alpha = 0.2)

# ROC curve
library('pROC')
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
test_prob <- predict(m, newdata = wrangled_student_mat_por, type = "response")
test_roc <- roc(wrangled_student_mat_por$high_use ~ test_prob, plot = TRUE, print.auc = TRUE)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases

test_roc
## 
## Call:
## roc.formula(formula = wrangled_student_mat_por$high_use ~ test_prob,     plot = TRUE, print.auc = TRUE)
## 
## Data: test_prob in 259 controls (wrangled_student_mat_por$high_use FALSE) < 111 cases (wrangled_student_mat_por$high_use TRUE).
## Area under the curve: 0.6808
# Define a loss function (mean prediction error)
calc_class_err <- function(actual, predicted) {
  mean(actual != predicted)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
calc_class_err(actual = wrangled_student_mat_por$high_use, predicted = wrangled_student_mat_por$prediction)
## [1] 0.2972973
# call loss_func to compute the average number of wrong predictions in the (test) data
# Error rate should be near each other with training and testing data
calc_class_err(actual = wrangled_student_mat_por$high_use, predicted = test_pred)
## [1] 0.2972973

Our results shows that approximately 30 % of our predictions are wrong, so the accuracy for our model is 0.7 which is at least better than pure guess. We selected 0.5 as a threshold for classification, that value is our choice and it goodness for accuracy can be study from ROC-curve. In this study, it’s okay. Test error rate is same as train error rate, so our model seems to be fitted properly. When predictions are compared to high_use we can see that our predictions have too many < 2 points drinker compared to real situation, and reversal too less high user.

10-fold cross-validation

Lets perform 10-fold cross-validation for our model to see if we get better test set performance. 10-fold means that all our observations are grouped to 10 bins

library(caret)

#specify the cross-validation method
ctrl <- trainControl(method = "cv", number = 10)

#fit a regression model and use k-fold CV to evaluate performance
model <- train(factor(high_use) ~ freetime + studytime + age, data = wrangled_student_mat_por, family = "binomial", method = "glm", trControl = ctrl)
model
## Generalized Linear Model 
## 
## 370 samples
##   3 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 333, 333, 333, 333, 333, 334, ... 
## Resampling results:
## 
##   Accuracy   Kappa     
##   0.6998815  0.09210631
# Show also summary of model
summary(model)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6143  -0.8690  -0.6895   1.2088   2.2716  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.2829     1.7668  -2.424 0.015345 *  
## freetime      0.3246     0.1236   2.626 0.008630 ** 
## studytime    -0.5928     0.1575  -3.765 0.000167 ***
## age           0.2119     0.1014   2.089 0.036700 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 422.36  on 366  degrees of freedom
## AIC: 430.36
## 
## Number of Fisher Scoring iterations: 4
#View final model parameters
model$finalModel
## 
## Call:  NULL
## 
## Coefficients:
## (Intercept)     freetime    studytime          age  
##     -4.2829       0.3246      -0.5928       0.2119  
## 
## Degrees of Freedom: 369 Total (i.e. Null);  366 Residual
## Null Deviance:       452 
## Residual Deviance: 422.4     AIC: 430.4

The results shows that our accuracy is app. 0.70 meaning that 30 % of our predictions are wrong. Actually, results are almost same than with Logistic regression model, so in this case there isn’t any point to use 10-fold cross-validation. Typically, as k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller but there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically choice for k has been done using k=10 or k=5 because those has been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance According summary table of model, study time has most negative impact to alcohol use. Freetime has most positive correlation to high_use, and age variable has also significant effect to drinking habit. Basically, results seems to be line with earlier results.

Perform cross-validation to compare the performance of different logistic regression models

For testing and comparing different models performances, I have select 10 variables as a starting point and removed ‘the worst’ one after one that we have three variables left. Selected variables are: * famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

  • Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

  • Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

  • traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

  • famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

  • sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

  • health - current health status (numeric: from 1 - very bad to 5 - very good)

  • freetime - free time after school (numeric: from 1 - very low to 5 - very high)

  • studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

  • age - student’s age (numeric: from 15 to 22)

The data has been divided to test and training sets, so we could better understood those two relationships.

# select only columns under interest
wrangled_student_mat_por_sub <- wrangled_student_mat_por[,colnames(wrangled_student_mat_por) %in% c("famsize","Medu","Fedu","traveltime","famrel","sex","health","freetime","studytime","age","high_use")]

# Divide data to test and training sets by using 70 % of the data for the training set
dat <- list()
dat[['index']] <- sample(nrow(wrangled_student_mat_por_sub), size = nrow(wrangled_student_mat_por_sub) * 0.7)
dat[['training']] <- wrangled_student_mat_por_sub[dat$index, ]
dat[['test']] <- wrangled_student_mat_por_sub[-dat$index,]

# Empty lists
count_of_variables <- test_mses <- training_mses <- list()
# For loop to test different counts of variables
for (i in 1:(ncol(wrangled_student_mat_por_sub)- 1)) {
    
    # Build new model every iteration containing one variable less    
    train_mod <-glm(high_use~., data = dat$training[,i:ncol(dat$training)])
    
    # Predict values for training data
    y_hat_training <- predict(train_mod, type = "response")
    train_predict = y_hat_training > 0.5
    # Collect training error
    training_mses[[i]] <- calc_class_err(actual = dat$training$high_use, predicted = train_predict)
    
    # Predict values for testing data
    y_hat_test <- predict(train_mod, newdata = dat$test)
    test_predict = y_hat_test > 0.5
    test_mses[[i]] <- calc_class_err(actual = dat$test$high_use, predicted = test_predict)
    
    count_of_variables[[i]] <- length(train_mod$coefficients) - 1
}

# Build dataframe from results to be used in ggplot
df<-data.frame(Number_of_Variables=unlist(count_of_variables), Test_error=unlist(test_mses),
                Training_error=unlist(training_mses))
# Build plot
library("reshape2")
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
df_long<-melt(df, id="Number_of_Variables",)
df_long$Number_of_Variables<-factor(df_long$Number_of_Variables, levels=df$Number_of_Variables)
a<-ggplot(df_long, aes(x = Number_of_Variables,y=value, color=variable, group=variable)) +
    geom_point() +
    geom_line() +
    geom_smooth() +
    scale_x_discrete(limits = rev(levels(df$Number_of_Variables)))
a
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

In the plot, test - and training errors against ‘Number of prediction variables in model’ are shown with own lines. I have selected ten variables which might have effect to alcohol consumption. Typically, training error which is less than test error indicates that model over fits to the training data. The model can be think to be good enough when test and training errors are near each other. E.g. from the plot, can be see that training error first rise in range 10-8 and in same range test error is lower and remains almost unchanged. Over fitting can be seen with range 7-5 and 3-1. When we have four variables in our model, test and training error are near each others indicating goodness of our model. And this results, seems to change depending how data is divided to test and training sets. According these results, the logistic regression model with four variables seems to be good if selected variables are good. Different grouping for selected variables should be also performed to catch out which four variables would be the best ones.


Assignment 4: Clustering and classification

This assignment topic is clustering and classification. As basic, clustering means that in the data some points/observations are in space so near each other compared to other points/observations that those form a ‘own cluster’. There are multiple different clustering methods to find those clusters from the data. One of the most common methods is k-means clustering, others worth naming are hierarchical clustering methods which gives dendrograms as a output. If the clustering can be done successfully, new observations can be tried to classify to those clusters.

To clarify this topic, Boston dataset from MASS - package will be used as a demonstration dataset.

The Boston Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The Boston data frame has 506 rows and 14 columns. The following describes the dataset columns:

Summary and graphical overview of the data

## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## corrplot 0.92 loaded
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
## [1] 506  14
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000

There are 506 different observations for 14 variables whivh summaries informations concerning housing in the area of Boston Mass. By using this dataset, we can try to find out different relationships between variables. E.g. per capita crime rate relationship to other variables is very interesting. If we could find out variables which has the most effect to it, could in future try to effect these variables and in advance reduce develop of crime rate when planning new residential areas. From the summary table, we can see that not all information from variables are easy or wise to use as a predictors in same model as those are now. E.g. “proportion of residential land zoned for lots over 25,000 sq.ft.” (zn) - variable observations are unevenly distributed where as “pupil-teacher ratio by town” (ptratio) - variable observations are much more evenly distributed. But interesting is that when we plot other variables against crime rate, we can see that there could be correlation even that the distribution wouldn’t be so clear. From the correlation matrix, we can see that there are some kind positive or negative correlation with per capita crime rate and other variables. Six variables have negative correlation and seven have positive. Most less negative correlation is with “Charles River dummy variable (1 if tract bounds river; 0 otherwise)” (chas) and overall negative correlate factors impact seems to be low. Positively correlate factors impact seems to be much higher and “index of accessibility to radial highways” (rad) - and “full-value property-tax rate per $10,000” (tax) seems to have the most positive correlation to per capita crime rate. Same results are visualized in the plot where conclusions can be made more easily.

Standardize the dataset

To get more clear understanding about the data and make it usable in prediction model, it has to be standardized. To do that, we scale and center the columns of a numeric matrix. Centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns. Scaling is done by dividing the (centered) columns of x by their standard deviations.

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
##         crim         zn      indus       chas        nox        rm        age
## 1 -0.4193669  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
##        dis        rad        tax    ptratio     black      lstat       medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375
## 4 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886
## 5 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323
## 6 1.076671 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
## [1] "data.frame"
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000

After scaling we can see that values across different variables are much more near each others. From the correlation matrix, we see that we got same results as with raw values.

##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
##         crim         zn      indus       chas        nox        rm        age
## 1 -0.4193669  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
##        dis        rad        tax    ptratio     black      lstat       medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375
## 4 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886
## 5 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323
## 6 1.076671 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582

Divide the dataset to train and test sets

  • Data will be divided 80/20 to train and test sets
# Bring Boston data table from github to R

boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt", sep=",", header = T)

# Change 'crime' to factor variable with relevant levels
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))

# Build up test and train datasets
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]
dim(train)
## [1] 404  14
# create test set 
test <- boston_scaled[-ind,]
dim(test)
## [1] 102  14
# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)
head(test)
##             zn      indus       chas        nox         rm        age       dis
## 1   0.28454827 -1.2866362 -0.2723291 -0.1440749  0.4132629 -0.1198948 0.1400750
## 2  -0.48724019 -0.5927944 -0.2723291 -0.7395304  0.1940824  0.3668034 0.5566090
## 4  -0.48724019 -1.3055857 -0.2723291 -0.8344581  1.0152978 -0.8090878 1.0766711
## 10  0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.3994130  0.6154813 1.3283202
## 13  0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.5630867 -1.0506607 0.7863653
## 14 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.4776917 -0.2406812 0.4333252
##           rad        tax    ptratio     black      lstat        medv
## 1  -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.15952779
## 2  -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.10142392
## 4  -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.18158864
## 10 -0.5224844 -0.5769480 -1.5037485 0.3289995  0.6227277 -0.39499459
## 13 -0.5224844 -0.5769480 -1.5037485 0.3705134  0.4280788 -0.09055093
## 14 -0.6373311 -0.6006817  1.1753027 0.4406159 -0.6151835 -0.23189977

Now we have build up test and training data by selecting randomly first 80 % off data as a training set and using 20 % as a test set. Both data set will be used in next part.

Fit the linear discriminant analysis on the train set

  • Crime rate as the target variable and all the other variables in the dataset as predictor variables.

Now we use training set in linear discriminat analysis. The function tries hard to detect if the within-class covariance matrix is singular. If any variable has within-group variance less than ‘tolerance to decide’^2 it will stop and report the variable as constant. As a return function gives:

*prior - the prior probabilities used.

*means - the group means.

*scaling - a matrix which transforms observations to discriminant functions, normalized so that within groups covariance matrix is spherical.

*svd - the singular values, which give the ratio of the between- and within-group standard deviations on the linear discriminant variables. Their squares are the canonical F-statistics.

*N - The number of observations used.

*call - The (matched) function call.

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2400990 0.2326733 0.2599010 0.2673267 
## 
## Group means:
##                  zn      indus        chas        nox          rm        age
## low       1.0638513 -0.9439680 -0.15056308 -0.9062632  0.42491303 -0.9160303
## med_low  -0.1038547 -0.3122741  0.02085925 -0.5941184 -0.10523925 -0.3086330
## med_high -0.3957690  0.2542119  0.17762524  0.3889174  0.04797559  0.4407640
## high     -0.4872402  1.0169921 -0.09005591  1.0450782 -0.44677556  0.8124500
##                 dis        rad        tax     ptratio       black      lstat
## low       0.9171892 -0.6811386 -0.7569078 -0.45136613  0.38993837 -0.7686159
## med_low   0.3352590 -0.5566940 -0.5048634 -0.07085903  0.30828457 -0.1590263
## med_high -0.3947501 -0.4098252 -0.2818597 -0.24208693  0.05123395  0.0484900
## high     -0.8574771  1.6393984  1.5149640  0.78225547 -0.84938203  0.9561152
##                 medv
## low       0.52371683
## med_low   0.01147008
## med_high  0.12721948
## high     -0.76336738
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.12410623  0.79271264 -0.99140573
## indus    0.06742771 -0.30411192  0.10204949
## chas    -0.10328707 -0.09614625  0.15091678
## nox      0.34675633 -0.60406816 -1.44267983
## rm      -0.08254497 -0.05932853 -0.11017638
## age      0.23715295 -0.32195756 -0.10682770
## dis     -0.11423118 -0.30292760  0.19275347
## rad      3.20726658  1.15668762 -0.07725844
## tax     -0.03069811 -0.30815977  0.69530735
## ptratio  0.10937436  0.00168332 -0.29662316
## black   -0.11176028  0.03471189  0.09556400
## lstat    0.18240992 -0.20323980  0.34250449
## medv     0.12268193 -0.43050565 -0.27697334
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9445 0.0428 0.0127
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2) +
  lda.arrows(lda.fit, myscale = 1)

## integer(0)

From the results plot we can see that rad seems to most driving variable in our training set for explaining LD1. And from the result object we can see that it has most positive coefficient 3.62727 to LD1 whereas second most effective variable seems to be nox with only 0.307. For the LD2, zn has highest positive coefficient 0.624225492. ## Predict the classes with the LDA model on the test data

## $class
##   [1] low      med_low  med_low  med_low  med_low  med_low  med_low  med_low 
##   [9] med_low  med_low  med_low  med_low  med_low  med_low  low      med_low 
##  [17] low      med_low  med_high med_low  med_low  med_high med_low  med_low 
##  [25] med_high med_high med_high med_high med_high med_high med_low  med_low 
##  [33] low      low      med_low  med_low  med_low  med_low  med_high med_high
##  [41] low      med_low  med_low  med_low  low      low      med_high med_high
##  [49] med_high med_high low      low      low      med_low  low      low     
##  [57] med_low  low      low      low      med_low  med_low  med_low  med_low 
##  [65] med_low  med_low  med_low  med_low  med_low  med_low  med_low  low     
##  [73] low      low      low      high     high     high     high     high    
##  [81] high     high     high     high     high     high     high     high    
##  [89] high     high     high     high     high     high     high     med_high
##  [97] med_high med_high med_high med_high med_high med_high
## Levels: low med_low med_high high
## 
## $posterior
##              low      med_low     med_high         high
## 1   7.061488e-01 2.541931e-01 3.965804e-02 1.458680e-21
## 2   1.829623e-01 7.512559e-01 6.578177e-02 4.029751e-20
## 4   3.800840e-01 5.485032e-01 7.141275e-02 2.900082e-20
## 10  5.553123e-02 6.787251e-01 2.657437e-01 4.256574e-15
## 13  1.629912e-01 7.261414e-01 1.108674e-01 3.888742e-16
## 14  1.781130e-01 5.023732e-01 3.195139e-01 2.120676e-16
## 17  3.264534e-01 4.919828e-01 1.815639e-01 2.846061e-17
## 31  3.557956e-02 5.317648e-01 4.326556e-01 2.373137e-14
## 33  2.477147e-02 4.959563e-01 4.792723e-01 8.968320e-14
## 45  2.249392e-01 7.551104e-01 1.995035e-02 3.423812e-20
## 47  1.871912e-01 7.973333e-01 1.547551e-02 7.280096e-20
## 48  5.914619e-02 8.909216e-01 4.993219e-02 2.731753e-18
## 49  1.442363e-02 9.324114e-01 5.316498e-02 5.140303e-17
## 63  3.749380e-01 4.294241e-01 1.956379e-01 2.892503e-12
## 66  9.924663e-01 7.436129e-03 9.760624e-05 1.525268e-20
## 75  1.378087e-01 8.458433e-01 1.634808e-02 1.939481e-18
## 93  6.127357e-01 3.279464e-01 5.931787e-02 5.155105e-17
## 99  3.811949e-01 5.422775e-01 7.652751e-02 1.918500e-21
## 101 5.746840e-02 3.604721e-01 5.820595e-01 1.929266e-14
## 110 4.333411e-02 4.843486e-01 4.723173e-01 1.152317e-13
## 111 1.089504e-01 6.077860e-01 2.832636e-01 8.388532e-15
## 116 2.088960e-02 4.735206e-01 5.055898e-01 1.754870e-12
## 117 3.901125e-02 5.250264e-01 4.359623e-01 1.984934e-13
## 118 4.077850e-02 5.170403e-01 4.421812e-01 2.601108e-13
## 122 3.699331e-02 2.097188e-01 7.532879e-01 7.882462e-17
## 126 2.856305e-02 1.809417e-01 7.904952e-01 1.205791e-16
## 143 5.774836e-05 3.021531e-03 9.969207e-01 1.884688e-12
## 151 8.942529e-05 6.888653e-04 9.992217e-01 3.150237e-12
## 160 1.266619e-04 5.249327e-04 9.993484e-01 8.120833e-13
## 164 7.179201e-04 2.875727e-02 9.705248e-01 1.441329e-16
## 176 3.350407e-01 5.530099e-01 1.119494e-01 6.652023e-17
## 182 2.943795e-01 4.912605e-01 2.143600e-01 9.866133e-18
## 193 8.032881e-01 1.822409e-01 1.447104e-02 2.202645e-18
## 198 9.817076e-01 1.800681e-02 2.855552e-04 1.078180e-22
## 206 2.328984e-01 7.146562e-01 5.244536e-02 5.031825e-18
## 208 4.839630e-02 7.476242e-01 2.039795e-01 5.588573e-16
## 212 1.330408e-02 8.604422e-01 1.262537e-01 1.378569e-16
## 213 5.062945e-02 8.764848e-01 7.288580e-02 3.412185e-18
## 224 4.469383e-02 3.245357e-01 6.307704e-01 2.462394e-11
## 238 6.037994e-02 3.212311e-01 6.183890e-01 3.141310e-12
## 244 7.221445e-01 2.718709e-01 5.984648e-03 9.272038e-18
## 245 1.407219e-01 7.610002e-01 9.827792e-02 2.757394e-13
## 249 2.414680e-01 6.676958e-01 9.083612e-02 2.414252e-14
## 250 4.251364e-01 5.375356e-01 3.732806e-02 1.003510e-15
## 251 4.924126e-01 4.833015e-01 2.428595e-02 7.275212e-16
## 257 9.966257e-01 2.877347e-03 4.969417e-04 9.747080e-21
## 258 1.962239e-02 8.522507e-03 9.718551e-01 1.203663e-14
## 259 4.503640e-02 2.576982e-02 9.291938e-01 9.521777e-14
## 261 9.260006e-02 4.824854e-02 8.591514e-01 4.727443e-14
## 269 3.260933e-01 1.256214e-01 5.482853e-01 8.784716e-16
## 273 6.849921e-01 2.760405e-01 3.896745e-02 1.181525e-18
## 276 9.138727e-01 7.297884e-02 1.314851e-02 2.461647e-18
## 279 9.020656e-01 8.886864e-02 9.065789e-03 3.112422e-18
## 283 3.382620e-01 5.356398e-01 1.260982e-01 9.563983e-19
## 284 9.964091e-01 3.406537e-03 1.844027e-04 1.439243e-25
## 290 8.713914e-01 1.243887e-01 4.219887e-03 7.532499e-17
## 297 6.771850e-02 8.868150e-01 4.546646e-02 4.419483e-19
## 303 6.455780e-01 3.367636e-01 1.765838e-02 3.187821e-15
## 306 7.775170e-01 1.279103e-01 9.457268e-02 9.885637e-13
## 308 7.605308e-01 1.197530e-01 1.197162e-01 1.442567e-12
## 313 8.431569e-02 4.784337e-01 4.372506e-01 2.260815e-15
## 318 6.519362e-02 5.907841e-01 3.440222e-01 1.027488e-15
## 319 1.173597e-01 5.042603e-01 3.783800e-01 2.880027e-16
## 323 2.204625e-01 6.615076e-01 1.180300e-01 2.781159e-16
## 326 3.484166e-01 5.850426e-01 6.654071e-02 1.101668e-17
## 327 2.805546e-01 6.326282e-01 8.681724e-02 3.413454e-17
## 330 1.563351e-01 8.330209e-01 1.064401e-02 4.608274e-20
## 331 1.008746e-01 8.864680e-01 1.265742e-02 1.182868e-19
## 341 2.799243e-01 5.234036e-01 1.966721e-01 1.463311e-15
## 343 1.386838e-01 8.399229e-01 2.139328e-02 6.572901e-23
## 347 6.510441e-02 9.186673e-01 1.622831e-02 2.719610e-20
## 348 9.914418e-01 8.136828e-03 4.213229e-04 1.264892e-19
## 350 8.609338e-01 1.356020e-01 3.464270e-03 1.195814e-23
## 351 8.472641e-01 1.495487e-01 3.187123e-03 2.387653e-23
## 352 8.759684e-01 1.207263e-01 3.305250e-03 5.875935e-20
## 361 1.020838e-18 1.593885e-16 7.223143e-12 1.000000e+00
## 370 3.282298e-17 2.775037e-14 4.061455e-10 1.000000e+00
## 373 1.582145e-18 1.825019e-15 5.097079e-11 1.000000e+00
## 388 2.291045e-20 3.641750e-17 6.196130e-14 1.000000e+00
## 396 1.507012e-18 6.151039e-16 2.181443e-12 1.000000e+00
## 399 3.810361e-20 5.477555e-17 8.687434e-14 1.000000e+00
## 408 2.614018e-19 1.294118e-16 9.541308e-13 1.000000e+00
## 421 3.138449e-19 1.090848e-16 1.182178e-12 1.000000e+00
## 422 3.746146e-19 1.285671e-16 9.251375e-13 1.000000e+00
## 437 1.695517e-20 5.657852e-18 1.007612e-13 1.000000e+00
## 444 1.763450e-19 7.795827e-17 1.030890e-12 1.000000e+00
## 446 2.505045e-21 1.676806e-18 3.798580e-14 1.000000e+00
## 448 5.931152e-19 1.938865e-16 1.708308e-12 1.000000e+00
## 463 3.839279e-18 1.260105e-15 8.103064e-12 1.000000e+00
## 465 7.335548e-17 2.611039e-14 3.505876e-11 1.000000e+00
## 467 8.121395e-20 5.954118e-17 3.011785e-13 1.000000e+00
## 478 4.191291e-19 7.162861e-16 3.904523e-13 1.000000e+00
## 480 2.097632e-17 1.027279e-14 1.035790e-11 1.000000e+00
## 483 6.839396e-15 3.140476e-12 5.639709e-10 1.000000e+00
## 486 4.936432e-15 1.952944e-12 3.763156e-10 1.000000e+00
## 489 1.078323e-03 2.797023e-01 7.192194e-01 5.881909e-14
## 493 1.442769e-03 2.289713e-01 7.695860e-01 1.050613e-14
## 496 5.968567e-02 4.504698e-01 4.898445e-01 3.397635e-13
## 499 4.363631e-02 2.898367e-01 6.665270e-01 8.710286e-13
## 501 3.558680e-02 3.023119e-01 6.621013e-01 1.765996e-12
## 503 2.436203e-01 3.437590e-01 4.126208e-01 6.169536e-19
## 504 1.937524e-01 2.249849e-01 5.812626e-01 4.444404e-19
## 
## $x
##            LD1         LD2         LD3
## 1   -3.8272700 -0.08000167 -0.57510788
## 2   -3.4180851 -0.58552433  0.86996911
## 4   -3.4858592 -0.34572920  0.19976289
## 10  -2.0184130 -0.59515752  0.92003593
## 13  -2.3409206  0.02508983  0.86457166
## 14  -2.4275939 -0.48932713 -0.05211165
## 17  -2.6830227 -0.16875913 -0.11810653
## 31  -1.7934510 -0.83223902  0.70010347
## 33  -1.6166883 -0.89624032  0.78311294
## 45  -3.4204021  0.05082209  1.36501289
## 47  -3.3176398  0.16872270  1.64264513
## 48  -2.8599569 -0.50843484  1.78125975
## 49  -2.4407160 -0.83047618  2.50618584
## 63  -1.3272959  0.99496039 -0.05298283
## 66  -3.2851259  3.27003691 -0.96337160
## 75  -2.9176277  0.33858750  1.89151968
## 93  -2.6043291  0.69484853 -0.21102064
## 99  -3.8074570 -0.64504101  0.08550186
## 101 -1.8353306 -0.78341712 -0.02702034
## 110 -1.6161415 -0.63379076  0.52113465
## 111 -1.9720323 -0.27771574  0.49822247
## 116 -1.2561769 -0.69672474  0.86623454
## 117 -1.5476454 -0.58797780  0.69626716
## 118 -1.5178880 -0.54929963  0.66186766
## 122 -2.4445135 -1.61154191 -0.59175636
## 126 -2.3756086 -1.69384853 -0.62350560
## 143 -0.7423706 -3.26278569 -1.45156926
## 151 -0.6463827 -2.98301283 -3.00035641
## 160 -0.8142369 -2.96444637 -3.44283050
## 164 -2.0862185 -3.22580533 -0.74400633
## 176 -2.5775245  0.15178639  0.23757474
## 182 -2.8062280 -0.39477902 -0.18105424
## 193 -2.9338385  1.18631340 -0.25907253
## 198 -3.9293526  2.23600374 -0.79731300
## 206 -2.8545419  0.09954975  0.94761480
## 208 -2.2479800 -0.72949234  1.15080608
## 212 -2.3373016 -1.17512745  2.06730624
## 213 -2.8336206 -0.72982103  1.65588495
## 224 -0.9760473 -0.21572929  0.13055253
## 238 -1.2342845 -0.28544177 -0.05735605
## 244 -2.7536353  1.69084096  0.62917012
## 245 -1.5579285  0.66815462  1.19895767
## 249 -1.8672897  0.69176567  0.81107023
## 250 -2.2432297  1.04103531  0.71305927
## 251 -2.2746665  1.27784711  0.75222766
## 257 -3.3387639  2.48554034 -2.65361893
## 258 -1.6950439 -1.38816018 -3.25441628
## 259 -1.5393863 -0.85879647 -2.53761502
## 261 -1.6842561 -0.61557724 -2.26570835
## 269 -2.2501072 -0.31012069 -1.83393272
## 273 -3.0387635  0.57389030 -0.30820278
## 276 -2.8888247  1.32709243 -1.11073314
## 279 -2.8595230  1.51508316 -0.73416097
## 283 -3.0798327 -0.31841944  0.03844503
## 284 -4.6336876  1.85376557 -2.28912637
## 290 -2.4772126  2.16767366  0.04845034
## 297 -3.0796840 -0.58779001  1.71597030
## 303 -2.0931021  1.69893555  0.49079005
## 306 -1.4277233  1.57602199 -1.16838570
## 308 -1.3849142  1.49430572 -1.32578662
## 313 -2.1134769 -0.71089779  0.14701994
## 318 -2.1952876 -0.78756424  0.55639925
## 319 -2.3730547 -0.71124419  0.06491482
## 323 -2.3945728  0.08959042  0.60238814
## 326 -2.7815994  0.23605703  0.48290422
## 327 -2.6459373  0.12935576  0.55054913
## 330 -3.3546796  0.22611295  1.93767447
## 331 -3.2263704  0.05443686  2.13201563
## 341 -2.2144496  0.11666014  0.06839939
## 343 -4.1377892 -0.80217687  1.49066652
## 347 -3.3833153 -0.39013181  2.20402724
## 348 -3.0739019  2.77945666 -1.54761101
## 350 -4.3215410  0.70875102 -0.16314807
## 351 -4.2410709  0.80674301 -0.00718966
## 352 -3.3143849  1.58092896 -0.04106760
## 361  5.9890698 -0.37351671 -1.93383850
## 370  5.4989658 -1.03945037 -0.74032436
## 373  5.8214256 -1.20609152 -0.84946338
## 388  6.3667568  0.37969276  0.77377929
## 396  5.9423198  0.31031082 -0.27606673
## 399  6.3148394  0.41416639  0.75253924
## 408  6.1192423  0.03654566 -0.51004742
## 421  6.1110886  0.01557771 -0.85606103
## 422  6.1007360  0.19941056 -0.66408911
## 437  6.4460461  0.08799446 -1.04495526
## 444  6.1590187 -0.14485360 -0.83662071
## 446  6.6214767 -0.19270998 -0.81721670
## 448  6.0448502  0.08253728 -0.79631592
## 463  5.8318220  0.04603474 -0.68633219
## 465  5.5159225  0.45884424  0.04631027
## 467  6.2410041  0.13096960 -0.12669833
## 478  6.0463789  0.59750745  1.29276907
## 480  5.6500116  0.55597870  0.35570969
## 483  5.0126051  0.83870076  1.03301614
## 486  5.0588787  0.91330885  0.94265792
## 489 -1.4832861 -2.40024197  1.46637809
## 493 -1.6957866 -2.47588436  1.07322141
## 496 -1.5039417 -0.41031519  0.31834664
## 499 -1.3656110 -0.57772064 -0.07402241
## 501 -1.2727407 -0.59008741  0.07836439
## 503 -3.1241103 -1.04556508 -0.81671313
## 504 -3.1415397 -1.32070493 -1.28056784
## [1] 0.4019608
##           predicted
## correct    low med_low med_high high
##   low       14      13        3    0
##   med_low    6      18        8    0
##   med_high   0      10       10    1
##   high       0       0        0   19

The results shows as that the high and low variable groups are grouped much better to rigth classes compared to med_high and med_low groups. Overall our error rate is still around 33 % depending how observations are divided to train and test data. This is pretty high and could be better.

Reload the Boston dataset and standardize the dataset

To get better error rate we can try to do standardise our data set by using distances. * Scale all variables to have mean = 0 and standard deviation = 1

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.269   2.000   4.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.929   2.000   3.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.729   2.000   2.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   3.000   2.666   3.000   5.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   4.312   6.000   6.000

From the summaries, we can see that distances between used method differs quite much qiven longer distances by Manhattan method. Euclidean distance is the shortest path between source and destination. Manhattan distance is sum of all the real distances between source and destination. Two clusters seems to be best. By using center 4, number of cluster range between 1-4, and 2 seems to be the most common. When studying by setting higher value for center, 4 cluster seems to be most common choice. So, it seems to be hard to use summary as a only way to select good number of clusters. But from the plots we can see that 4 clusters could be presented in the data, but even that atleast 2 and 3 clusters should be also tried.

Bonus: k-means on the original Boston data with 3 clusters

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.743   2.000   3.000
## [1] 404  15
## [1] 102  15
##          crim          zn      indus       chas        nox         rm
## 3  -0.4169290 -0.48724019 -0.5927944 -0.2723291 -0.7395304  1.2814456
## 8  -0.4032966  0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.1603069
## 10 -0.4003331  0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.3994130
## 11 -0.3939564  0.04872402 -0.4761823 -0.2723291 -0.2648919  0.1314594
## 21 -0.2745709 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -1.0171035
## 23 -0.2768170 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.2030044
##           age         dis        rad        tax    ptratio     black      lstat
## 3  -0.2655490 0.556609050 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324
## 8   0.9778406 1.023624897 -0.5224844 -0.5769480 -1.5037485 0.4406159  0.9097999
## 10  0.6154813 1.328320207 -0.5224844 -0.5769480 -1.5037485 0.3289995  0.6227277
## 11  0.9138948 1.211779950 -0.5224844 -0.5769480 -1.5037485 0.3926395  1.0918456
## 21  1.0488914 0.001356935 -0.6373311 -0.6006817  1.1753027 0.2179309  1.1716657
## 23  0.8215287 0.086363887 -0.6373311 -0.6006817  1.1753027 0.4406159  0.8495847
##          medv
## 3   1.3229375
## 8   0.4965904
## 10 -0.3949946
## 11 -0.8190411
## 21 -0.9712629
## 23 -0.7972951
## Call:
## lda(N_clusters ~ ., data = train)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.4702970 0.3267327 0.2029703 
## 
## Group means:
##         crim         zn      indus        chas        nox          rm
## 1 -0.3724278 -0.3387499 -0.2829898  0.03849464 -0.3098577 -0.07912173
## 2  0.7707649 -0.4872402  1.1304030  0.05576262  1.1890969 -0.51987167
## 3 -0.4082840  1.5687708 -1.1015857 -0.08027540 -1.0112004  0.88743088
##           age         dis        rad        tax    ptratio      black
## 1  0.02723655  0.01214864 -0.5786988 -0.6052722 -0.1000423  0.2679608
## 2  0.82255145 -0.85370946  1.1680242  1.2695565  0.5468315 -0.5919408
## 3 -1.17655926  1.26178915 -0.6037174 -0.6590027 -0.7421679  0.3540631
##        lstat        medv
## 1 -0.1783253  0.06127032
## 2  0.8633549 -0.69424886
## 3 -0.9375719  0.95206252
## 
## Coefficients of linear discriminants:
##                 LD1          LD2
## crim    -0.05143915  0.128168802
## zn      -0.05754639  1.091843941
## indus    0.49017626  0.071189895
## chas     0.04698341 -0.068524065
## nox      1.06863590  0.719733618
## rm      -0.15043419  0.380911844
## age     -0.12108442 -0.593230736
## dis     -0.12373888  0.564951427
## rad      0.69328145  0.007068937
## tax      0.92825702  0.816155360
## ptratio  0.24143922  0.085061886
## black    0.01470321 -0.013028639
## lstat    0.17568563  0.484228519
## medv    -0.06969142  0.626189685
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8378 0.1622

## integer(0)

Plot of the LDA results shows that tax and rad has most effect to LD1 and age for LD2, and even so are the most influential linear separators for the clusters.

SuperBonus

##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
##           zn      indus       chas        nox        rm        age      dis
## 1  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948 0.140075
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034 0.556609
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490 0.556609
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.076671
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743 1.076671
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100 1.076671
##          rad        tax    ptratio     black      lstat       medv N_clusters
## 1 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278          1
## 2 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239          1
## 3 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375          1
## 4 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886          3
## 5 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323          3
## 6 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582          1
##   crime
## 1   low
## 2   low
## 3   low
## 4   low
## 5   low
## 6   low
## [1] 404  15
## [1] 102  15
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2500000 0.2574257 0.2425743 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       1.08456771 -0.9369443 -0.11640431 -0.8924742  0.5212608 -0.8766213
## med_low  -0.09115733 -0.3312543  0.03952046 -0.5917217 -0.1115500 -0.3625932
## med_high -0.39571399  0.1510421  0.14409500  0.3846657  0.1429959  0.4294170
## high     -0.48724019  1.0149946 -0.03128211  1.0406720 -0.4855777  0.7957547
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8529692 -0.6987343 -0.7602375 -0.49441664  0.3799540 -0.79526046
## med_low   0.3642410 -0.5554602 -0.5409951 -0.04120056  0.3201782 -0.17644351
## med_high -0.3749755 -0.3822389 -0.2825025 -0.26326606  0.1074278 -0.05752024
## high     -0.8584374  1.6596029  1.5294129  0.80577843 -0.7464259  0.89613928
##                medv
## low       0.5929402
## med_low   0.0102126
## med_high  0.2096063
## high     -0.6231054
## 
## Coefficients of linear discriminants:
##                 LD1         LD2        LD3
## zn       0.11030024  0.90477115 -0.9432520
## indus   -0.02369693 -0.21284527  0.3534688
## chas    -0.06681638 -0.04469796  0.1555483
## nox      0.32271479 -0.76681432 -1.3321508
## rm      -0.11928537 -0.08149308 -0.1426609
## age      0.35340567 -0.29789983 -0.2669457
## dis     -0.08563196 -0.47438844  0.2270172
## rad      3.09835255  1.17955576  0.1675980
## tax      0.10830898 -0.36338770  0.2630112
## ptratio  0.15263545  0.06163493 -0.2299144
## black   -0.17731254  0.01064261  0.1035738
## lstat    0.14115142 -0.15376102  0.3960080
## medv     0.18691351 -0.36944033 -0.1857561
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9436 0.0416 0.0148
## [1] 404  13
## [1] 13  3
## Warning: package 'plotly' was built under R version 4.2.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

5: Dimensionality reduction techniques

Data wrangling

Data wrangling was started in last week, and script can be found from file create_human.R. There should be 195 observations and 19 variables. Most of the variables names has been shortened and couple new variables added to the original dataset as follow: * “GNI” = Gross National Income per capita * “Life.Exp” = Life expectancy at birth * “Edu.Exp” = Expected years of schooling * “Mat.Mor” = Maternal mortality ratio * “Ado.Birth” = Adolescent birth rate

  • “Parli.F” = Percetange of female representatives in parliament
  • “Edu2.F” = Proportion of females with at least secondary education
  • “Edu2.M” = Proportion of males with at least secondary education
  • “Labo.F” = Proportion of females in the labour force
  • “Labo.M” ” Proportion of males in the labour force

New variables * “Edu2.FM” = Edu2.F / Edu2.M * “Labo.FM” = Labo2.F / Labo2.M

There is still some editing e.g. Gender Inequality Index could be also shortened to ‘GII’. Lets’ start wrangling by that.

# Read file to R-environment
human_data<-read.table("human_data.tsv", header=TRUE, sep=";", dec=",")


# EDIT Gender Inequality Index column
colnames(human_data)[colnames(human_data)=="Gender Inequality Index"]<-"GII"

# Libraries
library(dplyr)

# Change GNI col to numeric by mutate
human_data$GNI<-human_data %>%
  dplyr:::select(GNI) %>%
  mutate(GNI =as.numeric(GNI))


# Exclude unneeded variables
keep <- c("Country", "Edu2.FM", "Labo.FM", "Edu.Exp", "Life.Exp", "GNI", "Mat.Mor", "Ado.Birth", "Parli.F")
human_data.subset <- human_data[,keep]

# Editing previous way 'GNI' make the column to class 'dataframe'. This would be problematic in plots
# so lets' changed it back to really numeric
human_data.subset$GNI<-as.numeric(unlist(human_data.subset$GNI))

# Remove rows with all NA
human_data.subset<-human_data.subset[rowSums(is.na(human_data.subset))!= ncol(human_data.subset),]
dim(human_data.subset)
## [1] 195   9
# Seems that there isn't any complete NA rows
# Lets' remove rows that contains even one NA
human_data.subset<-human_data.subset[complete.cases(as.matrix(human_data.subset)), ]
dim(human_data.subset)
## [1] 162   9
# After subsetting there are 162/195 observations which haven't had any missing variables

# Next we should remove observations related to region
# to find out which, unique values of Country column should be taken

unique(human_data.subset$Country)
##   [1] "Afghanistan"                              
##   [2] "Albania"                                  
##   [3] "Algeria"                                  
##   [4] "Arab States"                              
##   [5] "Argentina"                                
##   [6] "Armenia"                                  
##   [7] "Australia"                                
##   [8] "Austria"                                  
##   [9] "Azerbaijan"                               
##  [10] "Bahamas"                                  
##  [11] "Bahrain"                                  
##  [12] "Bangladesh"                               
##  [13] "Barbados"                                 
##  [14] "Belarus"                                  
##  [15] "Belgium"                                  
##  [16] "Belize"                                   
##  [17] "Benin"                                    
##  [18] "Bhutan"                                   
##  [19] "Bolivia (Plurinational State of)"         
##  [20] "Bosnia and Herzegovina"                   
##  [21] "Botswana"                                 
##  [22] "Brazil"                                   
##  [23] "Bulgaria"                                 
##  [24] "Burkina Faso"                             
##  [25] "Burundi"                                  
##  [26] "Cambodia"                                 
##  [27] "Cameroon"                                 
##  [28] "Canada"                                   
##  [29] "Central African Republic"                 
##  [30] "Chad"                                     
##  [31] "Chile"                                    
##  [32] "China"                                    
##  [33] "Colombia"                                 
##  [34] "Congo"                                    
##  [35] "Congo (Democratic Republic of the)"       
##  [36] "Costa Rica"                               
##  [37] "Côte d'Ivoire"                            
##  [38] "Croatia"                                  
##  [39] "Cuba"                                     
##  [40] "Cyprus"                                   
##  [41] "Czech Republic"                           
##  [42] "Denmark"                                  
##  [43] "Dominican Republic"                       
##  [44] "East Asia and the Pacific"                
##  [45] "Ecuador"                                  
##  [46] "Egypt"                                    
##  [47] "El Salvador"                              
##  [48] "Estonia"                                  
##  [49] "Ethiopia"                                 
##  [50] "Europe and Central Asia"                  
##  [51] "Fiji"                                     
##  [52] "Finland"                                  
##  [53] "France"                                   
##  [54] "Gabon"                                    
##  [55] "Gambia"                                   
##  [56] "Georgia"                                  
##  [57] "Germany"                                  
##  [58] "Ghana"                                    
##  [59] "Greece"                                   
##  [60] "Guatemala"                                
##  [61] "Guyana"                                   
##  [62] "Haiti"                                    
##  [63] "Honduras"                                 
##  [64] "Hungary"                                  
##  [65] "Iceland"                                  
##  [66] "India"                                    
##  [67] "Indonesia"                                
##  [68] "Iran (Islamic Republic of)"               
##  [69] "Iraq"                                     
##  [70] "Ireland"                                  
##  [71] "Israel"                                   
##  [72] "Italy"                                    
##  [73] "Jamaica"                                  
##  [74] "Japan"                                    
##  [75] "Jordan"                                   
##  [76] "Kazakhstan"                               
##  [77] "Kenya"                                    
##  [78] "Korea (Republic of)"                      
##  [79] "Kuwait"                                   
##  [80] "Kyrgyzstan"                               
##  [81] "Latin America and the Caribbean"          
##  [82] "Latvia"                                   
##  [83] "Lebanon"                                  
##  [84] "Lesotho"                                  
##  [85] "Liberia"                                  
##  [86] "Libya"                                    
##  [87] "Lithuania"                                
##  [88] "Luxembourg"                               
##  [89] "Malawi"                                   
##  [90] "Malaysia"                                 
##  [91] "Maldives"                                 
##  [92] "Mali"                                     
##  [93] "Malta"                                    
##  [94] "Mauritania"                               
##  [95] "Mauritius"                                
##  [96] "Mexico"                                   
##  [97] "Moldova (Republic of)"                    
##  [98] "Mongolia"                                 
##  [99] "Montenegro"                               
## [100] "Morocco"                                  
## [101] "Mozambique"                               
## [102] "Myanmar"                                  
## [103] "Namibia"                                  
## [104] "Nepal"                                    
## [105] "Netherlands"                              
## [106] "New Zealand"                              
## [107] "Nicaragua"                                
## [108] "Niger"                                    
## [109] "Norway"                                   
## [110] "Oman"                                     
## [111] "Pakistan"                                 
## [112] "Panama"                                   
## [113] "Papua New Guinea"                         
## [114] "Paraguay"                                 
## [115] "Peru"                                     
## [116] "Philippines"                              
## [117] "Poland"                                   
## [118] "Portugal"                                 
## [119] "Qatar"                                    
## [120] "Romania"                                  
## [121] "Russian Federation"                       
## [122] "Rwanda"                                   
## [123] "Samoa"                                    
## [124] "Saudi Arabia"                             
## [125] "Senegal"                                  
## [126] "Serbia"                                   
## [127] "Sierra Leone"                             
## [128] "Singapore"                                
## [129] "Slovakia"                                 
## [130] "Slovenia"                                 
## [131] "South Africa"                             
## [132] "South Asia"                               
## [133] "Spain"                                    
## [134] "Sri Lanka"                                
## [135] "Sub-Saharan Africa"                       
## [136] "Sudan"                                    
## [137] "Suriname"                                 
## [138] "Swaziland"                                
## [139] "Sweden"                                   
## [140] "Switzerland"                              
## [141] "Syrian Arab Republic"                     
## [142] "Tajikistan"                               
## [143] "Tanzania (United Republic of)"            
## [144] "Thailand"                                 
## [145] "The former Yugoslav Republic of Macedonia"
## [146] "Togo"                                     
## [147] "Tonga"                                    
## [148] "Trinidad and Tobago"                      
## [149] "Tunisia"                                  
## [150] "Turkey"                                   
## [151] "Uganda"                                   
## [152] "Ukraine"                                  
## [153] "United Arab Emirates"                     
## [154] "United Kingdom"                           
## [155] "United States"                            
## [156] "Uruguay"                                  
## [157] "Venezuela (Bolivarian Republic of)"       
## [158] "Viet Nam"                                 
## [159] "World"                                    
## [160] "Yemen"                                    
## [161] "Zambia"                                   
## [162] "Zimbabwe"
# From the print:
regions<-c("South Africa","South Asia","Latin America and the Caribbean","Europe and Central Asia","East Asia and the Pacific","World","Sub-Saharan Africa")

# Subset regions from the human_data
human_data.subset <- subset(human_data.subset, !(Country %in% regions))

# We can check by dimensions and unique values that correct rows has been removed
dim(human_data.subset)
## [1] 155   9
# add country to rownames
rownames(human_data.subset) <- human_data.subset$Country

human_data.subset <- human_data.subset[,!(colnames(human_data.subset) %in% c('Country'))]

# Check dimensions, should be 155 x 8
dim(human_data.subset)
## [1] 155   8
write.csv2(human_data.subset, "human_data_30112022.tsv", row.names=TRUE)

Analysis

This week topic is Dimensionality reduction techniques.

Summary of the data

# Libraries
library(tidyverse)
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.2.2
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
library(ggpubr)
library(MASS)
library(ggplot2)
library(GGally)
library(reshape2)
library(tidyr)
library(corrplot)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.2.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.2.2
# Let's try new package for producion of summary table and plot of the data
library(modelsummary)
## Warning: package 'modelsummary' was built under R version 4.2.2
# head of data
head(human_data.subset)
##               Edu2.FM   Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth
## Afghanistan 0.1987421 0.1979866     9.3     60.4  1885     400      86.8
## Albania     0.6854962 0.9306030    11.8     77.8  9943      21      15.3
## Algeria     0.2105263 0.8612903    14.0     74.8 13054      89      10.0
## Arab States 0.3081009 0.7289916    12.0     70.6 15722     155      45.4
## Argentina   0.6333333 0.9774306    17.9     76.3 22050      69      54.4
## Armenia     0.7465565 0.9894737    12.3     74.7  8124      29      27.1
##             Parli.F
## Afghanistan    27.6
## Albania        20.7
## Algeria        25.7
## Arab States    14.0
## Argentina      36.8
## Armenia        10.7
# Summary of the Human Data
datasummary_skim(human_data.subset)
Unique (#) Missing (%) Mean SD Min Median Max
Edu2.FM 155 0 0.7 0.2 0.2 0.8 1.0
Labo.FM 151 0 0.9 0.2 0.2 0.9 1.5
Edu.Exp 85 0 13.2 2.8 5.4 13.5 20.2
Life.Exp 112 0 71.7 8.3 49.0 74.2 83.5
GNI 155 0 17651.1 18539.2 581.0 12040.0 123124.0
Mat.Mor 88 0 149.2 211.8 1.0 49.0 1100.0
Ado.Birth 139 0 47.1 41.1 0.6 33.6 204.8
Parli.F 123 0 20.7 11.4 0.0 19.3 57.5
# Summary plot
## Scatterplot
ggpairs(human_data.subset, mapping = aes(), lower =list(combo = wrap("facethist", bins = 20))) 

# Correlation plot
datasummary_correlation(human_data.subset)
Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
Edu2.FM 1 . . . . . . .
Labo.FM .02 1 . . . . . .
Edu.Exp .05 .59 1 . . . . .
Life.Exp −.14 .59 .80 1 . . . .
GNI −.02 .43 .62 .63 1 . . .
Mat.Mor .24 −.66 −.74 −.87 −.50 1 . .
Ado.Birth .12 −.53 −.70 −.74 −.56 .76 1 .
Parli.F .26 .08 .21 .19 .09 −.09 −.07 1

There are differences between variables in value scale and distributions. E.g. our new variables has information combined from to column showing ratios, GNI has highest values and highest range. Overall, values difference from 0 to 123124 meaning that some sort of scaling and normalization should be done to make different variables to be comparable.

PCA raw data

Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. Formally, PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where (most of) the variation in the data can be described with fewer dimensions than the initial data. (wikipedia)

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_data.subset)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

From the plot, we can’t see much or use it like that for good interpretation. As far, we can say that GNI seems to influence PC1.

PCA standardized data

# standardize the variables
human_std <- scale(human_data.subset)

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)

After scaling we can see much better different variables. Edu.Exp, Life.Exp, Labb,FM and GNI influence to PC1 forming a group as well as Mat.Mor and Ado.Birth which influence is opposite and forms a group which have similar influence to PC1. Let’s read a plot e.g. we can see that Niger and Sierra Leone has high Mat.Mor and Ado.Birth and low GNI, Labo.FM, Life.Exp and Edu.Exp. This sounds realistic. And variables values in scale for e.g Japan, Korea and Czech are opposite (on the other side of mean/median).
There could be also negative correlation with these groups, and inside group positive correlation with variables. This could be study more but we can validate this from ealier build summary plots where correlation values can be read. Overall, we can see that these group formed from similarly influencing variables divide our countries to groups which have same level ‘standard of living’. PC2 is influenced by Padi.F and Edu2.FM which have also positive correlation to each other. It shows that countries e.g. Bolivia, Rwanda and Namibia seems to have similar level education between gender and it positivily correlates with Percetange of female representatives in parliament. As opposite, e.g Jordan has high difference between genders education level and also Percetange of female representatives in parliament level is low.

Multiple Correspondence Analysis (MCA)

In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA can be viewed as an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables.(wikipedia)

# Bring Tea data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

# Study data
View(tea)
# Dimensions
dim(tea)
## [1] 300  36
# Colnames
colnames(tea)
##  [1] "breakfast"        "tea.time"         "evening"          "lunch"           
##  [5] "dinner"           "always"           "home"             "work"            
##  [9] "tearoom"          "friends"          "resto"            "pub"             
## [13] "Tea"              "How"              "sugar"            "how"             
## [17] "where"            "price"            "age"              "sex"             
## [21] "SPC"              "Sport"            "age_Q"            "frequency"       
## [25] "escape.exoticism" "spirituality"     "healthy"          "diuretic"        
## [29] "friendliness"     "iron.absorption"  "feminine"         "sophisticated"   
## [33] "slimming"         "exciting"         "relaxing"         "effect.on.health"
# Summary
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   +60  :38   +2/day     :127  
##  middle      :40   sportsman    :179   15-24:92   1 to 2/week: 44  
##  non-worker  :64                       25-34:69   1/day      : 95  
##  other worker:20                       35-44:40   3 to 6/week: 34  
##  senior      :35                       45-59:61                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
# Summary plot
# Let's visualize part of the data
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
## Scatterplot
ggpairs(tea[,colnames(tea) %in% keep_columns], mapping = aes(), lower =list(combo = wrap("facethist", bins = 20))) 

keep_columns_2<- c("price","Tea")
# Mosaic plot of couple variables
mosaicplot(table(tea[,colnames(tea) %in% keep_columns_2]), xlab='Tea', ylab='Price',main='Tea and price', col='blue')

# Age has to be factorized by bins or removed --> remove
tea<-tea[!(colnames(tea) %in% 'age')]

# Multiple Correspondence Analysis (MCA)
# set to vizualize also the correlation between variables and MCA principal dimension
res.mca<- MCA(tea, graph = TRUE)

## Warning: ggrepel: 29 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# Extract and save results for later use
var <- get_mca_var(res.mca)
var
## Multiple Correspondence Analysis Results for variables
##  ===================================================
##   Name       Description                  
## 1 "$coord"   "Coordinates for categories" 
## 2 "$cos2"    "Cos2 for categories"        
## 3 "$contrib" "contributions of categories"
# Coordinates
head(var$coord)
##                    Dim 1        Dim 2       Dim 3       Dim 4       Dim 5
## breakfast      0.1818793  0.019872372 -0.10740185 -0.55341901 -0.09826250
## Not.breakfast -0.1678886 -0.018343728  0.09914017  0.51084831  0.09070384
## Not.tea time  -0.5562500  0.004261602  0.06242859  0.09798477  0.16970005
## tea time       0.4311760 -0.003303372 -0.04839139 -0.07595269 -0.13154265
## evening        0.2760859 -0.408578586  0.34396723  0.23142519 -0.26420301
## Not.evening   -0.1443495  0.213622307 -0.17984073 -0.12099896  0.13813660
# Cos2: quality on the factore map
head(var$cos2)
##                    Dim 1        Dim 2       Dim 3       Dim 4       Dim 5
## breakfast     0.03053547 3.645334e-04 0.010647837 0.282713168 0.008912786
## Not.breakfast 0.03053547 3.645334e-04 0.010647837 0.282713168 0.008912786
## Not.tea time  0.23984168 1.407766e-05 0.003021007 0.007442207 0.022322794
## tea time      0.23984168 1.407766e-05 0.003021007 0.007442207 0.022322794
## evening       0.03985286 8.728150e-02 0.061859319 0.028002208 0.036496104
## Not.evening   0.03985286 8.728150e-02 0.061859319 0.028002208 0.036496104
# Contributions to the principal components
head(var$contrib)     
##                   Dim 1        Dim 2      Dim 3     Dim 4     Dim 5
## breakfast     0.5036943 0.0066328082 0.22530472 6.7101105 0.2373656
## Not.breakfast 0.4649486 0.0061225922 0.20797359 6.1939482 0.2191067
## Not.tea time  4.2859707 0.0002774934 0.06925046 0.1913584 0.6440432
## tea time      3.3222613 0.0002150984 0.05367935 0.1483310 0.4992288
## evening       0.8301633 2.0055059548 1.65293457 0.8393006 1.2274180
## Not.evening   0.4340448 1.0485640271 0.86422467 0.4388221 0.6417465
# visualize MCA - many possibilites
# MCA factor map
plot(res.mca, invisible=c("ind"), graph.type = "classic")

# Eigenvalues/Variances
# get values
eig.val <- get_eigenvalue(res.mca)
head(eig.val)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 0.09006849         5.837772                    5.837772
## Dim.2 0.08165357         5.292361                   11.130133
## Dim.3 0.07021443         4.550936                   15.681069
## Dim.4 0.06259673         4.057196                   19.738264
## Dim.5 0.05578674         3.615807                   23.354071
## Dim.6 0.05345502         3.464677                   26.818748
# Build plot
# Shows how much first ten dimension of data explains it"
fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 45))

# Biplot
# The plot above shows a global pattern within the data. Rows (individuals) are represented by blue points and columns (variable categories) by red triangles.

# The distance between any row points or column points gives a measure of their similarity (or dissimilarity). Row points with similar profile are closed on the factor map. The same holds true for column points.
fviz_mca_biplot(res.mca, 
               repel = TRUE,
               ggtheme = theme_minimal())
## Warning: ggrepel: 279 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 59 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# Cos2
# The quality of the representation is called the squared cosine (cos2), which measures the degree of association between variable categories and a particular axis. 

# Color by cos2 values: quality on the factor map
fviz_mca_var(res.mca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, # Avoid text overlapping
             ggtheme = theme_minimal())
## Warning: ggrepel: 61 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# Contribution of variable categories to the dimensions 1-2
fviz_contrib(res.mca, choice = "var", axes = 1:2, top = 15)

# The red dashed line on the graph above indicates the expected average value, If the contributions were uniform.
fviz_mca_var(res.mca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, # avoid text overlapping (slow)
             ggtheme = theme_minimal()
             )
## Warning: ggrepel: 61 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# The plot above gives an idea of what pole of the dimensions the categories are actually contributing to. 

(more chapters to be added similarly as we proceed with the course!)